1. Introduction:

In this world, we have to strive to make Lives easier for people. This involves, among many things, reducing physical and mental pain! As citizens of NYC and avid travellers, we chose to work on NYPD Motor Vehicle Collisions dataset because we were interested in finding the features that could affect vehicular accidents and hence safe travel. This dataset included many variables (Geolocation, Time, Date, Reason, Injuries Deaths etc.) for each reported accident in NYC, we wanted to test different hypothesis based on our assumptions to see if we could find any patterns, and hopefully would come up with the general explanations, which can be useful in understanding the accidents occurred in NYC overall. To give you a brief overview on how we investigated the data using the example, one of the hypotheses we had in mind was that there would be an increase in the number of accidents during rush hours. To test the assumption, we evaluated the number of collisions and found that they indeed peaked in the morning and evening hours on weekdays. Based on this correlation, we further tested our hypotheses by looking at the map if locations where collisions occurred during rush hours were matched to any congestion areas in NYC. The hence found locations can be highly impactful in right hands(the police, mayor’s office etc.). Similarly, each team member postulated several “theories”, collected evidence (data), explored the data maze and made conclusions on initial assumptions.

Description on how each member contributed to the project:

Enough self appreciation, let’s move on to the Actual Study.

2. Data Description

We collected NYPD Motor Vehicle Collisions dataset, which was provided by Police Department (NYPD), from NYC OpenData website. Our dataset includes data of every reported vehicle accident in NYC by Date, time, Location (city, borough, precinct, and cross street), the number of injured and killed people, contributing factors, and vehicle types. Due to Local Law #11 which was passed in 2011, the data is manually run every month and reviewed by the TrafficStat Unit before it is released on the NYPD website. The dataset can be easily accessed via the link (https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) and the data dictionary is here: https://data.cityofnewyork.us/api/views/h9gi-nx95/files/b5fd8e71-ca48-4e96-bf63-1b8a7c4cc47b?download=true&filename=Collision_DataDictionary.xlsx.

One interesting feature of our data is that it gives the number of people injured and killed by categories such as motorist, cyclist, and pedestrians. Such detail enable people to analyze the data in more depth and compare them in different scenarios.

Most features are self-explanatory, except “CONTRIBUTING FACTOR VEHICLE 1,” “CONTRIBUTING FACTOR VEHICLE 2,” and etc. These are the reasons of the vehicles that caused the accidents. There should be at least one contributing vehicle.

We also incorporated historical weather data, where ‘AWND_m_s’stands for wind speed, ’PRCP’ stands for precipitation amount, ‘SNOW’ stands for the amount of snow fall in mm, ‘SNWD’ stands for snow depth, and etc. We then converted some of the categorical variables like “rain”, “fog” to 0 and 1, and etc.(See https://github.com/Srinidhi-kv/EDAV_Project)

Besides, we used date function to label more information such as a weekday/weekend flags, the hour, and the day of week.

All these transformations and labeling process enriched our analysis.

3. Data Quality Study/ Sanity Check

3.1. Missing Value analysis

3.1.1 Overall missing pattern:

skimr::skim(df) %>% filter(stat == "missing") %>%
arrange(desc(value)) %>% select(variable, type, value) %>% mutate(percent = value/nrow(df))
## # A tibble: 59 x 4
##    variable                      type      value percent
##    <chr>                         <chr>     <dbl>   <dbl>
##  1 VEHICLE.TYPE.CODE.5           factor  1241099   0.997
##  2 CONTRIBUTING.FACTOR.VEHICLE.5 factor  1240976   0.997
##  3 VEHICLE.TYPE.CODE.4           factor  1228605   0.987
##  4 CONTRIBUTING.FACTOR.VEHICLE.4 factor  1227984   0.986
##  5 VEHICLE.TYPE.CODE.3           factor  1167357   0.937
##  6 CONTRIBUTING.FACTOR.VEHICLE.3 factor  1164932   0.935
##  7 OFF.STREET.NAME               factor  1048226   0.842
##  8 ZIP.CODE                      numeric  356753   0.286
##  9 BOROUGH                       factor   356613   0.286
## 10 CROSS.STREET.NAME             factor   317016   0.255
## # ... with 49 more rows

3.1.2 What are the longitude and latitude missing pattern?

3.1.2.1 Are they missing at random(MAR)?

We can see from the below image (using missing_data.frame) that longitude and latitude are not related to the value of other variables. To verify this, we plotted a detailed graph to see if the missing of latitude was related to the value of TIME. Our reasoning was that if accidents happened in late night, the latitude would more likely be missing.

x <- missing_data.frame(sample_vehicle)
image(x)

However, as shown in the picture below, the value of TIME was not related to the missing pattern of latitude.

x <- missing_data.frame(sample_vehicle[,c("TIME","LATITUDE")])
image(x)

3.1.2.2 Are they missing not at random (MNAR)?

We used the following two pieces of data to verify,

  • A: the original dataset

  • B: the subset of dataset that is missing latitude and longitude but not missing borough

Please check this link to see the animated gif to see the difference between the accidents distribution (percentage) over borough on dataset A and dataset B https://github.com/mxc19912008/readme_pics/blob/master/image/Missing%20latitude1.gif Also, please see the percentage table below to see the difference: Both the BRONX and QUEENS increased around 2 percent when the latitude was missing.

Longitude and latitude indeed are missing not at random(MNAR).

missing_lat_not_missing_borough<-subset(vehicle, (is.na(df[,5])) & (!is.na(df[,3])))

barplot(prop.table(table(df$BOROUGH)),
                 main="Baseline: Original borough distribution",
                 xlab="Borough",
                 ylab="Count Pctg",
                 border="red",
                 col="blue",
                 density=10 
                 
         )

barplot(prop.table(table(missing_lat_not_missing_borough$BOROUGH)),
                 main="Missing borough distribution",
                 xlab="Borough",
                 ylab="Count Pctg",
                 border="red",
                 col="blue",
                 density=10 
                 
         )
prop.table(table(df$BOROUGH))
prop.table(table(missing_lat_not_missing_borough$BOROUGH))

3.1.3. Column and row based missing pattern:

It can be seen from the figure below that there is an obvious column-wise pattern of missing values:
The percentage of missing data of Contributing factor vehicle 1 and vehicle type code 1 are the same. This also applies to Contributing factor vehicle 2 and vehicle type code 2, and etc. The reason is that these two features describe the same thing. Another obvious pattern is that the missing values percentage increases as the number of contributing factor vehicles increase. The reason is that multiple vehicles collision accidents become rarer as the number of contributing vehicles increases. Date and time have no missing values. The missing values exist across location-related features, including longitude, latitude, and location. One possible reasons could be due to a lack of records. For example, vehicles might have driven away too fast to get recorded. In terms of row-wise missing values, the most frequent pattern is two car collisions with no missing values except when there are other contributing vehicles involved.

visna(vehicle, sort = "r")

3.2. Data Sanity

We observed that the top category of the important feature “Vehicle Contributing Factor”, which tells us the reason that the accident happened, was “Unspecified.” We hypothesized that this could be hiding some important information. Is this “Completely at Random”?

Hypothesis 1: Details of “minor” accidents occuring at night were more likely not captured perfectly. Is this the case?

Finding: If any, very slightly

Hypothesis 2: Details of accidents (occuring at night or not) were more likely not captured perfectly. Is this the case?

Finding: If any, very slightly

So, the reason is more likely to be completely at random

Now, we come to the crux of the Report, the actual analysis. Bored already? Don’t worry, I shall keep you interested with my humour (of lack of it thereof).

4. Main: Exploratory Data Analysis

4.1 Study 1: Time and Space study

Hypothesis 1: Intuitively, number of accidents vary largely with time of the day(Yes, even for the city that never sleeps)

  1. Total amount of car accidents in NYC by weekdays and weekends

Let’s look at the amount of car accidents in the city and see if there’s any difference between the collisions occurred on weekdays and weekends.

Before analyzing the data, my initial assumption was that there would be more accidents happening during the rush hour on weekdays, which was roughly correct. On weekdays, highest peaks occurred approximately around 7am to 9am and 4pm to 7pm. But for weekends, accident rates were high in the afternoon around 1pm to 5pm. Since there is no unanimous weekend rush hour in NYC, it was hard to pinpoint that these time ranges were part of the weekend rush hour. However, I’d like to note that the weekend rush hour in the afternoon followed a similar trend as the weekday rush hour, especially the peaks between 1pm and 3pm, 3pm and 5pm.

Hypothesis 2: Trend of total collision

By looking at each borough, we were able to tell that there were more accidents occurred in Brooklyn, Manhattan, and Queens than the Bronx and Staten Island. Note that NA refers to the data with missing boroughs.

From here, I decided to look more into the data by geo-location during the rush hour, which indicate where the accidents happened during the rush hour.

One thing to note is that since there were many missing latitude and longitude values, not all accidents were recorded in the dataset. Further analysis on missing values can be found at the analysis for data quality part.f

To go back to the analysis, first, I separated the data by the number of persons killed and injured and located them in the map.

For weekdays, I showed the locations of persons injured and killed during 7am ~ 9am and 4pm ~ 7pm.

It is clear that there were higher rates of injuries and deaths associated with the accidents in the evening than those in the morning.

For weekends, I showed the locations of persons injured and killed during 1pm ~ 5pm.

Without much surprise, both the number of injured and killed people showed fewer collision rates on weekends than those on weekdays. From here, I wanted to explore the data if these locations matched to any congestion areas in NYC.

Hypothesis 3: Congestion areas and fatal collisions.

  1. Trend of fatalities during rush hour

Since there were more death counts in the evenings based on above graphs, I focused on them to test if there was any locations matched to the congestion areas during 4pm ~ 7pm.

Again, I excluded the weekends accidents and weekdays morning rush hour because a) the purpose of this analysis was identifying whether congestion areas had any influence on fatal collisions and b) their numbers were not sufficient to test against my hypothesis.

  1. For weekdays, I showed the locations of persons killed during 4pm ~ 7pm

  1. A map of each borough

I zoomed in closer to see boroughs one by one, which would help me identify the congestion areas.

  • Manhattan

Considering only on weekday evenings rush hour, I could not draw a conclusion on the relationship between congestion areas and death counts with the information I had.

  • Queens

Near I-495, Horace Harding Expressway and Long Island Expressway had high death counts because these areas are one of the worst traffic corridors.

  • Brooklyn

Though there was a widespread tendency for death collisions, I was able to identify death counts in some areas such as Bedford Stuyvesant, Flatbush Avenue near Ditmars Park, and Sheepsheade Bay Road where were known for issues with the traffic congestion problems.

  • The Bronx

It was no surprise to see that near East 161st street by Yankee Stadium and Bronx courthouses had double death counts since they were infamously known for traffic problems including congestion and double parking. There have been many plans rolled out to help improve conditions on the busy road such as launching the Bx6 Select Bus Service.

  • Staten Island

Hylan Boulevard, Staten Island’s longest commercial roadway, serves as one of borough’s primary roadways. Due to the nature and function of this corridor, Hylan Boulevard is frequently congested on weekdays, which marked the highest Data counts in the map.

4.2 Study 2: Impact of National Holidays on Traffic & Accidents

Summary: Holidays, yay! There is one more reason to cheer for them! Lowest accidents dates come as the holidays with closing shops and other business. These days are New Years Day, National Day, and Christmas. One extreme outlier that appears in the plot is on Feb 29, which is only collected in lunar year (2016).

ggplot(data = date_of_year, aes(x = doy, y = n, group = 1)) + 
    geom_line() + 
    scale_x_discrete(breaks = 10) + 
    labs(title = "Accidents Year Round Distribution", x = "Date", y = "Count of Accidents") +
    geom_text(aes(label=ifelse(n<2100,doy,'')),nudge_x = -5, nudge_y = 0) 

4.3 Study 3: Study of Alcohol Involvement

Hypothesis 1: Accidents involving alcohol are likely to occur on weekends

Finding: When we restricted the accident type into Alcohol involvement, the difference between weekdays and weekends are pretty clear.

Our hypothesis was that there was a higher amount of accidents because of alcohol consumption during the weekends than the weekdays. And the following plot proved our findings. Besides, we can see the trend of accidents increasing as the weekend is approaching. This result showed a relatively high number of accidents on Friday, which is the end of weekdays and the beginning of weekend as well.

ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ], 
       aes(Day.of.week, group = 1)) + 
    geom_histogram(stat = "count") + 
    ggtitle("Alcohol Involvement Accident Weekly Pattern") +
    scale_x_discrete(limits = seq(0,6),labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

Hypothesis 2: Accidents involving alcohol are likely to occur at late night

Finding: Accidents increased since evening and reached the peak during the late night.

Without any filters on the contributing factor on accidents, the distribution of the accidents showed a skewed bell curve with rush hour, as is shown in the previous study. Once we narrowed down to alcohol related accidents, it showed an entirely opposite trend. As we can see from the following plot, alcohol involved cases are at the bottom during the day and gradually climbed up since 3pm to midnight. Late night (from 12am to 5am) is the peak of accidents with alcohol involvement, which implies the alcohol activities and its causation to accidents.

ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ], aes(hour)) + 
    geom_histogram(stat = "count") + 
    ggtitle("Alcohol Involvement Accident hourly Pattern")

Hypothesis 3: Accidents involving alcohol will be clustered in Midtown-Downtown Manhattan

Finding: From the previous hypotheses, we sketched the features of alcohol involvement accidents are converged on weekends and during midnight. Now we will focus on where these accidents are located.

ManhattanMap <- qmap(location = "Manhattan", zoom = 11, color = "bw")
ManhattanMap + 
    geom_point(aes(x = LONGITUDE, y = LATITUDE), color = "gold", alpha = 0.1, data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ])+
  ggtitle("Alcohol Involvement on weekends and during midnight") + 
  labs(x="Longitude", y = "Latitude")

4.4 Study 4: Effect of Climate!

We as New Yorkians(This sounds way cooler, right? Never really liked Yorkers) are so flustered with Weather. We fall in love with it, go outside, and start hating it the next moment! Weather is such an integral part of human society that it is the topmost factor to consider, always. So, lets see how weather has an effect, correlation ofcourse(do I daresay causation?), on accidents! (effect of weather on pedestrians later). Don’t worry, you won’t be bored. I will keep you interested with my humour, or lack or it thereof.

Hypothesis 1: Temperature

Finding: Looks like our dear temperature has not much correlation with Accidents. Read on to find out more …

There does not seem to be much effect of Temparature on Accidents, Injuries or Deaths. Just to be sure, let’s look at the density plots

Hypothesis 2: Snowfall

Finding: Snowfall has a definite correlation with number of accidents and number of injuries!

Ah dear Lord Snow, you know nothing and are always butting in things!(Get it? No??? Watch Game of Thrones now!). Looks like higher the snowfall, higher the number of accidents and number of injuries! Lets confirm this from comparing density plots. Lord Snow is compassionate enough to come few times a year and hence not many deaths on the plot(data insufficient!)

(Bored already? Some humour: I call the above plot “Manhattan Graph”. Quick, spot the World Trade Center!!)

Hypothesis 3: Rain! Rain! Rain!

Finding: Like his sister Snow, Rain has a correltion with Accidents, Injuries and even Deaths. Read on …

Can’t make out anything from this! But, I was so sure that Rain makes driving and walking difficult and hence was hoping would result in more accidents!(Beware: Sadistic Tendencies :/ ). Now, let’s try and see my favorite density plots will unveil some abracadabra!

Hypothesis 4: Wind and Fog

Finding: Contrary to my belief, Wind and Fog don’t affect Accidents much. Now, beleive me when I say we have enough data! “Lets speed on to Density plots, I say”

4.5 Study 5: What impacts accidents involving pedestrians?

Now, we come to, in my opinion, the most impactful part of this analysis. In this world, we have to strive to make Lives easier to people, and a main part of that is that no one should be physically hurt at the least! Thanks to the technological advancement in safety features of Cars, Accident related injuries to people has drastically reduced since the early 90’s. However, we homo sapiens are awaiting an upgrade in body structure(BS, I tell you) from our beloved Lord Thor for >5000 years now. So, if we want to become more green, start walking (or taking public transport), we have to start wearing helmets while walking(Haha), or carefully follow whatever I say below (I mostly say, “Stay at home, grab a cup of coffee and a Novel, the surest way of being peaceful in life”!

Hypothesis 1: Pedestrians are more likely to be hit in the night, when it is difficult to spot people.

Finding: It is indeed! Special concern to the Pedestrian Deaths which seem to increase disproportionately at night.

You see the shifting peaks? Highly likely that at night, it is very difficult to spot people on the road, and by the time you spot someone, it is already too late! So many deaths at night! Be careful people, especially when the Sun goes down. Follow the traffic rules and be safe!

Hypothesis 2: A side study, inspired from above. Injuries and Accidents should be more rampant at night too!

Finding: Oh surely! Look at the disproportionate deats late in the night! More likely explanaiton is that people cannot concentrate from 12AM-4AM(sleep cycle!) and reaction times reduce dramatically then. This might lead to more “Severe” accidents.

Drivers, do you see that!? That is why my incredibly smart Mom always told me to be careful at night! Please avoid driving from 12AM-4AM and if unavoidable, get some coffee!(I’ll fund it myself if you want!)

Finding: Turns out low temperatures and high snow coorrelate highly with Pedestrian Accidents, Injuries and Deaths. Rain and Fog,not so much

1) Temprature and Pedestrians

2) Snow and Pedestrians

Snow and pedestrians, a better love story than Twilight :D

3) Rain and Pedestrians

#### Not much of a correlation

5. Learnings from the Interactive Map(Interactive Component)

Full version: https://github.com/mxc19912008/NYC-Accidents-Explorer

Interactive Shiny App: https://xiaochunma.shinyapps.io/NYPD_accidents_shiny/

5.1 Introduction

The NYPD adds and updates new records each month for the NYPD Motor Vehicle Collisions dataset(https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95). Each record in the data represents a collision in NYC by city, borough, precinct and cross street.

The objective was to make this data interactively visible to the audience so that the users are able to explore patterns with respect to accidents/injuries/death.

For this final project, a Shiny application was developed that allows users to work with an interactive map and see accidents, injuries and death locations and relative number which match a users-specified selection of criteria. In developing the interactive application, the data was selected by specifying the latest 10,000 records in 2018 for quick rendering and use of the map. Details of instructions, proposed analysis and hypothesis are provided in this final summary.

5.2 Proposed Analysis and Hypothesis

Task 1: Visualize the accidents/injuries/death on the map:

Users can choose the index in the Accidents Explorer: * Accidents * Injuries * Death

And the corresponding dots representing the index would show up on the map. With legend on the bottom left corner, users can easily find the dots with different color representing corresponding number of accidents/injuries/death. Users can also choose ??Add Cluster?? to visualize the clustering effects of the accidents/injuries/death on the map.

Hypothesis 1: Buiser areas(midtown, downtown and harlem) have more accidents

It can be seen from the picture that these three areas have more accidents than other areas. So this hypothesis is tested to be true.

Hypothesis 2: Large scale injuries tend not to happen in Manhattan since the speed limits are lower and this district has more traffic lights.

It can be seen from the picture that large scale injuries(marked by red arrow) tend to happen in other district. So this hypothesis is also tested to be true.

Task 2: Visualize the accidents/injuries/death regarding each type of vehicles on the map:

Other than choosing All Vehicles, users can choose one of these vehicles to visualize: * Ambulance * Bicycle * Bus * Fire Truck * Large Commercial Vehicle * Livery Vehicle * Motorcycle * Passenger * Pick-up Truck * Scooter * Small Commercial Vehicle * Sport Utility/Station Wagon * Taxi * Van

Task 3: Visualize the accidents/injuries/death across time(hour) on the map:

Users can also visualize the number of accidents/injuries/death across time(hour) on the side column. In this way, users are able to check the distributions of accidents/injuries/death over time.

Hypothesis 3: More accidents happen during rush hours(7-9,16-18) than other time.

It can be seen from the bar chart that there are more accidents which happened during rush hours(7-9,16-18) than other time slots. So this hypothesis is also tested to be true.

5.3 What can be done with our analysis?

  • Send more police to the busier areas;
  • Send more police during rush hours;
  • Reduce speed limit for spots that have more injuries.

5.4 Future work

One main thing I wanted to do was get sunrise and sunset times and study accidents in dark(I wish it was India and I could just say Anything from 6:30PM- 6:30AM is dark!). I even found an algorithm to implement this using Lat-Long, but, did not have enough time to do so.

We can incorporate major events and activities(Concerts, Games, Political etc.) to see if the accidents are correlated to them.

We could also have made time interactively be chosen so that we can get to know where to send more police at what time periods.

Due to the heaviness of data, we could not put it all on the interactive app. Maybe there is a way to overcome that.

6. Executive Summary

As the residents in New York City, we often talk about its public safety. In regard to our concerns about public safety, the first topic that would jump to our mind is definitely the commuting safety. People would be curious about the vehicle collisions amount, distribution over different periods of time, areas or regions, specific landscape, vehicle types, contributing factors, etc. Or maybe we’re even more eager to find out something beyond our commonsense. Is it more likely for accidents to happen during holidays, weekdays, late nights, downtown area, or severe weather? These hypotheses will all be tested as we’re moving forward to our findings.

So our major concern is the time period. We already know that the amount of accidents is different with various time span. Our first hypothesis is proved so that we can say there are more accidents happened during the rush hours on the weekdays than the weekends. This is consistent with our common sense. The rush hour is oriented from workdays and commuting is the direct explanation for this scenario. And from the plot we can see that the two peaks on workdays are 7am to 9am and 4pm to 7pm, which indicates that rush hour time period can drag up the amount of accidents. Hence we should be careful about the great amount of traffic and potential of accidents during that period of time. It was also shown that 12AM-4AM is when most fatal accidents happen, which we knew intuitively as the time when most people find it difficult to concentrate!

Once we know the effect of rush hours, then the boroughs where these accidents took place is our next focus. Our second hypothesis is that accidents in Brooklyn, Manhattan, and Queens are more than that of Bronx and Staten Island, which is also validated in the main analysis part. Once we look into the severity of these accidents, we are moving from the number of accidents to the count of injuries and deaths. The result is that server accidents are more likely to happen outside of Manhattan. The result shows that there’s higher rate of injury or death during the evening than that in the morning. It is because boroughs like Queens have highways which have higher speed limits thus could lead to more server accidents (quantified by number of death and injuries).

Within certain boroughs, there are particular areas where people activities are much greater than other arenas, which we called them as congestion areas in this study. Since Manhattan itself is already a congested in every aspect and standing out among all the boroughs, the only suggestion we can make is that the entire Manhattan should be considered as congestion. As for Queens, there are some expressways we should take care of since there’s a relatively high death rate. The noticeable area in Brooklyn and Bronx are both known for the traffic congestion problems. The busy roadway in Staten Island is also a congestion area due to its corridor structure.

After we have looked at daily and regional span of the traffic accidents, we looked at the accidents related to specific activities or events. Holiday is another special time. When arranged in days of a year, specific national holidays such as New Year’s Day, National Day. and Christmas have significantly lower amount of accidents. This is somehow beyond the commonsense but it’s actually related with less active businesses and commute. These three days are special in the extent of holidays, since all shops and businesses are closed. We can expect that people are staying at homes, which lead to a low amount of accidents.

Drinking related accidents is also a special case in the traffic accidents, since it easily intrigues accidents. There’s apparently more accidents with alcohol involved on the weekends compared to the weekdays, and there’s an increasing starting on Friday, where we can say Friday evening is the start of weekend. In addition, drinking accidents occur more frequently during the late night from midnight to 5am. This indicates that drunk drivers cause the car accidents. When we move these accidents to the map to see it more precisely, the accidents are converging in the midtown to downtown area of Manhattan. This also makes sense because the overnight bars are also clustering in the same place. As an example, we are showing this study here: (Note that total accidents peaked on weekdays and during rush hours)

theme_update(plot.title = element_text(hjust = 0.5))

ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ], 
       aes(Day.of.week, group = 1)) + 
    geom_histogram(stat = "count") + 
    ggtitle("Increased alcohol related incidents in the weekend") + labs(x="Day of week", y="Number of Accidents") +
  scale_x_discrete(limits = seq(0,6),labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

theme_update(plot.title = element_text(hjust = 0.5))

ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ], aes(hour)) + 
    geom_histogram(stat = "count") + 
    ggtitle("Increased alcohol related incidents late night")+labs(x="Hour of Day", y="Number of Accidents") 

As weather is always a good indicator for traffic, so is the vehicle accidents. With external weather data set, we can notice that the temperature, wind and fog actually doesn’t have much effect on the traffic accidents. However, snowfall and Rain does affect the number of accidents, as well as the injuries. These weather factors have a particularly high impact on Pedestrians, as shown. Aside from the graph, we can also picture that severe weather can result in traffic accidents since it’s hard to drive. Also, pedestrian related accidents are very likely at night than in day.

In order to show the dynamic data, we also used a map on Shiny to explore the traffic interactively. The options include accidents/injuries/death distribution, clustering, types of vehicles, hours, etc. To wrap up, here are some practical safety suggestions to avoid accidents in NYC: 1) avoid rush hours 2) be careful with congestion areas 3) avoid streets with bars especially late nights on weekends 4) be cautious about severe weather

7. Conclusion

We are happy that we selected such a (potentially) impactful dataset. We studied the data, incorporated external data on Weather and did a rigourous data sanity check. We then made several impactful hypotheses and methodically checked all of them and came to conclusions. We are satisfied that just based on this small initiative, travel in NYC can become safer!

We concluded that Weather, Time of Day, Special Occations and Alcoholism have are very much correlated with number of accidents and hence number of injuries and fatalities. We, as citizens, have to understand these factors and be careful, thereby keeping others safe.

In this final project, the limitations come from two aspects: 1. the nature of our large dataset, which makes the main analysis less easy to conduct. For example, when we were doing missing data analysis, all the rows are so compressed together that we could not see the pattern, thus we were only able to sample some rows or columns for analysis. 2. The corporation of external dataset: we originally had more hypotheses that required incorporating other datasets to find the reasons for certain accidents. We actually found weather data from Global Historical Climatology Network website, but other data, like highway speed data was hard to find and hard to bind to our datasets. Thus, several other hypotheses were not tested.

Future directions: 1. Use Hadoop/Spark to handle large data when doing bigdata visualization analysis. For example, this blog shows how to do data visualization in the browser with spark: https://databricks.com/session/visualizing-big-data-in-the-browser-using-spark. 2. Spend more time on data integration so that we can use more external data sources to validate our hypotheses.

We learned a lot in the process of this project. In general, we learned that validating one hypothesis and convincing others is not Easy. We need to ensure the data quality, the step-wise reasoning of our logic, showing our audiences intriguing visualizations to convince others that our hypothesis is true. Besides, the purpose of the whole project is to validate our hypotheses, instead of showing good-looking pictures, thus if a simple picture is clear enough to prove our hypothesis, adding other elements to the picture would not be a good idea.